We leveraged the The Covid Tracking Project for analyzing the COVID impact on the United States. The COVID Tracking Project is being utilized by many leading and trusted groups such as the CDC, HHS, and Johns Hopkins.
Using the simple read.csv command we load the data from our Data folder for our Term Project Github repository.
us <- read.csv("https://raw.githubusercontent.com/JaclynCoate/6373_Time_Series/master/TermProject/Data/USdaily7.19.csv", header = T, strip.white = T)
str(us)
## 'data.frame': 179 obs. of 25 variables:
## $ date : int 20200718 20200717 20200716 20200715 20200714 20200713 20200712 20200711 20200710 20200709 ...
## $ states : int 56 56 56 56 56 56 56 56 56 56 ...
## $ positive : int 3692061 3626881 3549648 3478695 3413313 3350434 3291969 3230991 3167984 3101339 ...
## $ negative : int 41273443 40576852 39802297 39048662 38351244 37653841 36990207 36323867 35751437 34994707 ...
## $ pending : int 3032 3002 2929 2947 960 2610 2639 2618 2493 2530 ...
## $ hospitalizedCurrently : int 57562 57705 57369 56143 55533 53988 52632 51834 51577 43924 ...
## $ hospitalizedCumulative : int 276439 274436 271758 269543 267127 264865 263651 262712 257571 255253 ...
## $ inIcuCurrently : int 6396 6453 6359 6317 6235 6074 5930 5939 5896 5845 ...
## $ inIcuCumulative : int 12342 12243 12091 12002 11857 11749 11679 11612 11523 11370 ...
## $ onVentilatorCurrently : int 2343 2353 2317 2317 2263 2259 2182 2169 2197 2127 ...
## $ onVentilatorCumulative : int 1211 1200 1175 1166 1154 1142 1136 1128 1118 1138 ...
## $ recovered : int 1122720 1106844 1090645 1075882 1049098 1031939 1006326 995576 983185 969111 ...
## $ dateChecked : Factor w/ 179 levels "2020-01-22T00:00:00Z",..: 179 178 177 176 175 174 173 172 171 170 ...
## $ death : int 132395 131523 130572 129598 128740 128004 127677 127201 126444 125590 ...
## $ hospitalized : int 276439 274436 271758 269543 267127 264865 263651 262712 257571 255253 ...
## $ lastModified : Factor w/ 179 levels "2020-01-22T00:00:00Z",..: 179 178 177 176 175 174 173 172 171 170 ...
## $ total : int 44968536 44206735 43354874 42530304 41765517 41006885 40284815 39557476 38921914 38098576 ...
## $ totalTestResults : int 44965504 44203733 43351945 42527357 41764557 41004275 40282176 39554858 38919421 38096046 ...
## $ posNeg : int 44965504 44203733 43351945 42527357 41764557 41004275 40282176 39554858 38919421 38096046 ...
## $ deathIncrease : int 872 951 974 858 736 327 476 757 854 867 ...
## $ hospitalizedIncrease : int 2003 2678 2215 2416 2262 1214 939 5141 2318 1719 ...
## $ negativeIncrease : int 696591 774555 753635 697418 697403 663634 666340 572430 756730 639952 ...
## $ positiveIncrease : int 65180 77233 70953 65382 62879 58465 60978 63007 66645 58836 ...
## $ totalTestResultsIncrease: int 761771 851788 824588 762800 760282 722099 727318 635437 823375 698788 ...
## $ hash : Factor w/ 179 levels "0063e380f01c09a74ff18ab1940b36390ef365d9",..: 118 30 96 170 111 66 61 116 131 5 ...
us <- transform(us, date = as.Date(as.character(date), "%Y%m%d"))
head(us)
## date states positive negative pending hospitalizedCurrently
## 1 2020-07-18 56 3692061 41273443 3032 57562
## 2 2020-07-17 56 3626881 40576852 3002 57705
## 3 2020-07-16 56 3549648 39802297 2929 57369
## 4 2020-07-15 56 3478695 39048662 2947 56143
## 5 2020-07-14 56 3413313 38351244 960 55533
## 6 2020-07-13 56 3350434 37653841 2610 53988
## hospitalizedCumulative inIcuCurrently inIcuCumulative
## 1 276439 6396 12342
## 2 274436 6453 12243
## 3 271758 6359 12091
## 4 269543 6317 12002
## 5 267127 6235 11857
## 6 264865 6074 11749
## onVentilatorCurrently onVentilatorCumulative recovered
## 1 2343 1211 1122720
## 2 2353 1200 1106844
## 3 2317 1175 1090645
## 4 2317 1166 1075882
## 5 2263 1154 1049098
## 6 2259 1142 1031939
## dateChecked death hospitalized lastModified total
## 1 2020-07-18T00:00:00Z 132395 276439 2020-07-18T00:00:00Z 44968536
## 2 2020-07-17T00:00:00Z 131523 274436 2020-07-17T00:00:00Z 44206735
## 3 2020-07-16T00:00:00Z 130572 271758 2020-07-16T00:00:00Z 43354874
## 4 2020-07-15T00:00:00Z 129598 269543 2020-07-15T00:00:00Z 42530304
## 5 2020-07-14T00:00:00Z 128740 267127 2020-07-14T00:00:00Z 41765517
## 6 2020-07-13T00:00:00Z 128004 264865 2020-07-13T00:00:00Z 41006885
## totalTestResults posNeg deathIncrease hospitalizedIncrease
## 1 44965504 44965504 872 2003
## 2 44203733 44203733 951 2678
## 3 43351945 43351945 974 2215
## 4 42527357 42527357 858 2416
## 5 41764557 41764557 736 2262
## 6 41004275 41004275 327 1214
## negativeIncrease positiveIncrease totalTestResultsIncrease
## 1 696591 65180 761771
## 2 774555 77233 851788
## 3 753635 70953 824588
## 4 697418 65382 762800
## 5 697403 62879 760282
## 6 663634 58465 722099
## hash
## 1 a6838d287f4a1270b49fde1e76e01bab4e2d7b2e
## 2 2fd5fa9462df840f6808d6757e4d13f702be8fdb
## 3 8644aacf2b4997ff08daf87ae088c7f6a44cb367
## 4 ef1be31b81541c1ff634f7adff6a5cebe14bab3e
## 5 9fff665c0a4f59af0ca8746fa5d90e91b35c0967
## 6 5b8ab51eb57a48502d534a29f1e5d62791ac8a24
During this project we will be trying to surface those metrics that properly evaluate the severity of COVID in the US. In order to simplify our data set we will remove all variables that do not directly link to metrics dealing with COVID or variables that are depcitign the same information. E.g. totalTestResults and postNeg.
us <- subset(us, select = -c(states, dateChecked, hospitalized, lastModified, total, posNeg, totalTestResultsIncrease, hash))
head(us)
## date positive negative pending hospitalizedCurrently
## 1 2020-07-18 3692061 41273443 3032 57562
## 2 2020-07-17 3626881 40576852 3002 57705
## 3 2020-07-16 3549648 39802297 2929 57369
## 4 2020-07-15 3478695 39048662 2947 56143
## 5 2020-07-14 3413313 38351244 960 55533
## 6 2020-07-13 3350434 37653841 2610 53988
## hospitalizedCumulative inIcuCurrently inIcuCumulative
## 1 276439 6396 12342
## 2 274436 6453 12243
## 3 271758 6359 12091
## 4 269543 6317 12002
## 5 267127 6235 11857
## 6 264865 6074 11749
## onVentilatorCurrently onVentilatorCumulative recovered death
## 1 2343 1211 1122720 132395
## 2 2353 1200 1106844 131523
## 3 2317 1175 1090645 130572
## 4 2317 1166 1075882 129598
## 5 2263 1154 1049098 128740
## 6 2259 1142 1031939 128004
## totalTestResults deathIncrease hospitalizedIncrease negativeIncrease
## 1 44965504 872 2003 696591
## 2 44203733 951 2678 774555
## 3 43351945 974 2215 753635
## 4 42527357 858 2416 697418
## 5 41764557 736 2262 697403
## 6 41004275 327 1214 663634
## positiveIncrease
## 1 65180
## 2 77233
## 3 70953
## 4 65382
## 5 62879
## 6 58465
Upon evaluating the columns that contain not available information, we can see in the early dates of data collection the US had a low count of COVID cases. Since it is likely that those metrics were actually zero since they weren’t even being collected at that time, we will repalce all NAs with zeros.
us[is.na(us)] <- 0
tail(us)
## date positive negative pending hospitalizedCurrently
## 174 2020-01-27 2 0 0 0
## 175 2020-01-26 2 0 0 0
## 176 2020-01-25 2 0 0 0
## 177 2020-01-24 2 0 0 0
## 178 2020-01-23 2 0 0 0
## 179 2020-01-22 2 0 0 0
## hospitalizedCumulative inIcuCurrently inIcuCumulative
## 174 0 0 0
## 175 0 0 0
## 176 0 0 0
## 177 0 0 0
## 178 0 0 0
## 179 0 0 0
## onVentilatorCurrently onVentilatorCumulative recovered death
## 174 0 0 0 0
## 175 0 0 0 0
## 176 0 0 0 0
## 177 0 0 0 0
## 178 0 0 0 0
## 179 0 0 0 0
## totalTestResults deathIncrease hospitalizedIncrease negativeIncrease
## 174 2 0 0 0
## 175 2 0 0 0
## 176 2 0 0 0
## 177 2 0 0 0
## 178 2 0 0 0
## 179 2 0 0 0
## positiveIncrease
## 174 0
## 175 0
## 176 0
## 177 0
## 178 0
## 179 0
All variables show a variance therefore we will keep all in the model for now
skim(us)
## Skim summary statistics
## n obs: 179
## n variables: 17
##
## ── Variable type:Date ──────────────────────────────────────────────────────────
## variable missing complete n min max median n_unique
## date 0 179 179 2020-01-22 2020-07-18 2020-04-20 179
##
## ── Variable type:integer ───────────────────────────────────────────────────────
## variable missing complete n mean sd
## deathIncrease 0 179 179 739.64 736.87
## hospitalizedIncrease 0 179 179 1544.35 1972.24
## negative 0 179 179 9542260.32 1.2e+07
## negativeIncrease 0 179 179 230577.89 228080.46
## positive 0 179 179 1e+06 1080318.9
## positiveIncrease 0 179 179 20626.03 18316.97
## totalTestResults 0 179 179 1.1e+07 1.3e+07
## p0 p25 p50 p75 p100 hist
## 0 3.5 655 1216 2740 ▇▂▃▂▁▂▁▁
## -2841 0 1245 2258.5 17320 ▁▇▂▁▁▁▁▁
## 0 2578 3269209 1.7e+07 4.1e+07 ▇▂▁▁▁▁▁▁
## -3 626 136639 414259.5 774555 ▇▂▂▂▂▁▂▁
## 2 1161 786933 1866669.5 3692061 ▇▂▂▂▂▁▁▁
## 0 203.5 21590 28924.5 77233 ▇▂▇▂▁▁▁▁
## 2 3739 4056142 1.9e+07 4.5e+07 ▇▂▁▁▁▁▁▁
##
## ── Variable type:numeric ───────────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25
## death 0 179 179 50519.59 49837.79 0 26.5
## hospitalizedCumulative 0 179 179 107636.91 1e+05 0 6
## hospitalizedCurrently 0 179 179 26293.04 21887.94 0 0
## inIcuCumulative 0 179 179 4237.38 4434.51 0 0
## inIcuCurrently 0 179 179 5476.62 5033.25 0 0
## onVentilatorCumulative 0 179 179 384.97 405.83 0 0
## onVentilatorCurrently 0 179 179 2346.8 2259.85 0 0
## pending 0 179 179 6483.64 14080.45 0 330
## recovered 0 179 179 271316.65 343207.89 0 0
## p50 p75 p100 hist
## 39028 1e+05 132395 ▇▁▁▁▁▂▂▂
## 97369 215272 276439 ▇▁▁▁▂▁▃▂
## 30890 44527 59538 ▇▁▁▂▃▃▂▃
## 2193 8737.5 12342 ▇▂▁▁▂▂▂▂
## 5597 9069 15130 ▇▁▃▃▂▂▂▂
## 214 720 1211 ▇▂▁▁▂▂▁▂
## 2172 4501.5 7070 ▇▁▃▁▂▂▂▁
## 2186 3576 65709 ▇▁▁▁▁▁▁▁
## 69493 563494.5 1122720 ▇▂▁▁▁▁▁▁
Below we will use a basic lines graph by day to see the general trend of some of the basic cumulative variables that are provided by our raw data set. The cumulative variables are interesting to see the general trend of the data. However, these would not be beneficial from a forecasting iniative due to the fact that these will just continue to trend upwards as the virus spreads. It gives no real insight that could be used for our COVID relief efforts for hospitals.
We can see that the total cumulative COVID cases are increasing overtime with what appears to be no reprieve.
library(ggplot2)
library(ggthemes)
ggplot(data = us, aes(x=date, y=positive))+
geom_line(color="black")+
labs(title = "Cumulative COVID Cases US", y = "Millions", x = "") +
theme_fivethirtyeight()
The total hospitilizations due to covid appear to also increase over time. We do see a little more movement in this graph, however, overall clearly increasing.
ggplot(data = us, aes(x=date, y=hospitalizedCumulative))+
geom_line(color="orange")+
labs(title = "Cumulative COVID Hospitalized Cases US", y = "Millions", x = "") +
theme_fivethirtyeight()
In an attempt to look at a more micro view of the COVID data we review some of the new and cumulative variables below. We are looking to surface those variables that will be the most benficial in forecasting the severity of cases and therefore hospital resources moving forward.
When we start to investigate at a more micro level and review something like the daily new cases we see more trending behavior and something that can be more properly evaluated. Newly confirmed cases have a little more inforamtion to tell use from a trend perspective. We can see some pseudo-cyclic behavior alongside some heavy wandering behavior.
ggplot(data = us, aes(x=date, y=positiveIncrease))+
geom_line(color="black")+
labs(title = "New COVID Cases US", y = "Thousands", x = "") +
theme_fivethirtyeight()
When measuring the severity of cases, we wanted to make sure that we are including the most severe cases; we believe that a good indicator of this would be a hospitalization metric. This is not only because the severe cases would end up hospitalized, but because of the intial fear of resources and space, it would also be a key forecast for medical communities to be prepared for potential next round of patients.
We have seen that the research suggests ‘that most people who contract the new coronavirus develop mild cases of Covid-19’ from an article released in June that can be found here. Due to this fact of likely a-symptomatic patients, hospitalization metrics reveal themselves to be a more accurate representation of severe cases.
Originally we were going to evaluate the new COVID hospitalization cases. However, upon closer review it became clear that large states that were hit the hardest were failling to reporting this metric. We were able to confirm this analysis when we compared the New versus Current metrics and see that there is a large spike in the hospitalization cases overall, but we are not seeing the same results reflected in the New COVID cases metric. Therefore, we err on the side of caution and move forward with modeling our Currently COVID hospitalized metric.
Looking at a more micro view of currently hospitalized COVID cases shows us much more behavior that we might be able to evaluate moving forward. Below we can see what appears to be heavy wandering behavior and if we look closely, some additional noise that could be pseudo-cyclic behavior hidden by the large numbers.
ggplot(data = us, aes(x=date, y=hospitalizedIncrease))+
geom_line(color="orange")+
labs(title = "New COVID Hospitalized Cases US", y = "Thousands", x = "") +
theme_fivethirtyeight()
ggplot(data = us, aes(x=date, y=hospitalizedCurrently))+
geom_line(color="orange")+
labs(title = "Current COVID Hospitalized Cases US", y = "Thousands", x = "") +
theme_fivethirtyeight()
When we review the new death cumulative count we can see what again appears to be a pseudo-cyclic behavior with maybe some slight wandering tendencies.
ggplot(data = us, aes(x=date, y=deathIncrease))+
geom_line(color="dark red")+
labs(title = "New COVID Death Cases US", y = "Thousands", x = "") +
theme_fivethirtyeight()
In order to review the data at an even smaller level we have created the positive percent metric as well as hospitilication, death, and ICU rates of positive case.
When reviewing the daily positivty rate, the metric discussed that is more reported on lately. We see the positivity rate actually decreasing over time. We would expect this outcome since we saw a drastic increase in testing over time as well.
While this metric is better than just new cases to forecast severity (it normalizes to number of tests), it doesn’t tell us anything about impact. We tend towards the currently hospitalized metrics because it allows us to take into consideration things such medical resources and why, we as a nation, were attempting to flatten the curve.
us$posRate <- us$positive / us$totalTestResults
ggplot(data = us, aes(x=date, y=posRate))+
geom_line(color="black")+
labs(title = "Daily Positivity Rate COVID Testing US", y = "Rate", x = "") +
theme_fivethirtyeight()
When reviewing rates we also chose to observe the trend of the hospitalization rate of positively tested COVID patients. We see the hospitalization rate start very low then spike to around 15%. The hospitilization rate seems to hover in the low teens for a few months and then it appears to beging to decrease.
While reviewing this data we determined that this was not a best metric to review moving forward for forecasting severity of cases. The reason being is that individuals who were hospitalized for reasons other than COVID but then returned a positive test would be listed as a COVID patient, even if the virus itself was not affecting them (a-symptomatic.) Therefore we tend towards currently hospitalized as our severity metric, since it takes into account all hospitalizatoin resources and doesn’t just try to focus on a metric that appears to be misleading at this time.
us$hospRate <- us$hospitalizedCumulative / us$positive
ggplot(data = us, aes(x=date, y=hospRate))+
geom_line(color="dark orange")+
labs(title = "Daily Hospitalization Rate COVID Testing US", y = "Rate", x = "") +
theme_fivethirtyeight()
In an opposite pattern of the previous rates we have evaluted we actually see a sharp increase in recovyer rate over time. It appears to be plateauing recently, but what we would expect is for it to continue to increase and provide a final percent of infected individuals who have recovered from COVID.
At first we thought the Recovery Rate would be a great opposite indicator to compare against a severity indicator. However, upon closer investigation we quickly realized that merely a few states were even collecting some of the recovered data. Without an actual number coming from all states this would not be a good refelcting of the recovery rate of COVID over all. This data would not be good to forecast.
us$recRate <- us$recovered / us$positive
ggplot(data = us, aes(x=date, y=recRate))+
geom_line(color="dark green")+
labs(title = "Daily Recovery Rate COVID Testing US", y = "Rate", x = "") +
theme_fivethirtyeight()
Like the previous rates we have examined before we see an initial spike in death rate and what appears to be drop off as quickly on the rate of death itself. When we see the death rate hovering right under 4% we should also nte that the flu death rate is about 1%. That puts COVID about 3 times as deadly as the flu.
Since death is the most severe outcome of the COVID virus we have chosen to model this metric as a measure of severity along side currently hospitalized. With these two metrics we feel like we will have a better picture of severity and be able to give our nation a better idea
us$deathRate <- us$death / us$positive
ggplot(data = us, aes(x=date, y=deathRate))+
geom_line(color="dark red")+
labs(title = "Daily Death Rate COVID Testing US", y = "Rate", x = "") +
theme_fivethirtyeight()
Trying to create pairs plots for all the variables against eachother would be very difficult to interpret. As suspected there are highly correlated variables since we are graphing things like total postive tests and new positive tests. These variables would be inherently correlated to one another because they are communicating similar data. Since this is the case we will only review a scatterplot of those variables we have surfaced above and will use for forecasting the multivariate testing.
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
scatter <- subset(us, select = c(positive, negative, hospitalizedCurrently, hospitalizedCumulative, death, totalTestResults, deathIncrease, hospitalizedIncrease, negativeIncrease, positiveIncrease, posRate, deathRate))
ggpairs(scatter)
In order to prep our data from our EDA we will load, transform the date, remove all zero row from the variable we are testing hospitalizedCurrently, and sort from oldest to newet. This will allow us to easily work through our univariate time series evaluations.
us <- read.csv("https://raw.githubusercontent.com/JaclynCoate/6373_Time_Series/master/TermProject/Data/USdaily.7.26.20.csv", header = T, strip.white = T)
us <- transform(us, date = as.Date(as.character(date), "%Y%m%d"))
us <- subset(us, select = -c(states, dateChecked, hospitalized, lastModified, total, posNeg, totalTestResultsIncrease, hash))
us[is.na(us)] <- 0
#Selecting only those dates with reported current hospitilizations
us <- dplyr::slice(us,1:132)
us = us[order(as.Date(us$date, format = "%Y%m%d")),]
head(us)
## date positive negative pending hospitalizedCurrently
## 132 2020-03-17 11928 63104 1687 325
## 131 2020-03-18 15099 84997 2526 416
## 130 2020-03-19 19770 108407 3016 617
## 129 2020-03-20 26025 138814 3330 1042
## 128 2020-03-21 32910 177262 3468 1436
## 127 2020-03-22 42169 213476 2842 2155
## hospitalizedCumulative inIcuCurrently inIcuCumulative
## 132 55 0 0
## 131 67 0 0
## 130 85 0 0
## 129 108 0 0
## 128 2020 0 0
## 127 3023 0 0
## onVentilatorCurrently onVentilatorCumulative recovered death
## 132 0 0 0 122
## 131 0 0 0 153
## 130 0 0 0 199
## 129 0 0 0 267
## 128 0 0 0 328
## 127 0 0 0 471
## totalTestResults deathIncrease hospitalizedIncrease negativeIncrease
## 132 75032 22 13 13707
## 131 100096 31 12 21893
## 130 128177 46 18 23410
## 129 164839 68 23 30407
## 128 210172 61 1912 38448
## 127 255645 143 1003 36214
## positiveIncrease
## 132 3613
## 131 3171
## 130 4671
## 129 6255
## 128 6885
## 127 9259
It is difficult to assume stationarity for this data due to multiple factors. We are working under the assumption that COVID is a novel virus and cases as well as hospitalizations will eventually return to zero. This being said our current modeling techniques do things such as return to the median or mimic the previously seen trends. Also, we see a heavy wandering trend in both new cases and hospitalization would be dependent on this as well as time. We will review the data and see what, if any, non-stationary components reveal themselves and model the data accordingly.
ggplot(data = us, aes(x=date, y=hospitalizedCurrently))+
geom_line(color="orange")+
labs(title = "Current COVID Hospitalized Cases US", y = "Thousands", x = "") +
theme_fivethirtyeight()
ACF:
Spectral Density:
plotts.sample.wge(us$hospitalizedCurrently, lag.max = 100)
## $autplt
## [1] 1.000000000 0.967247046 0.928472445 0.883791745 0.834039153
## [6] 0.779901780 0.722099406 0.660670233 0.595507232 0.527652159
## [11] 0.459798098 0.392707355 0.325070497 0.257394633 0.190076427
## [16] 0.123107742 0.058045025 -0.005592108 -0.062510518 -0.114255855
## [21] -0.163281121 -0.206979530 -0.242176807 -0.275097030 -0.300626756
## [26] -0.323018458 -0.340862401 -0.357234304 -0.371125261 -0.380223777
## [31] -0.387733922 -0.393904174 -0.399188337 -0.403702529 -0.407535556
## [36] -0.409058093 -0.406032310 -0.402291057 -0.397307333 -0.392192952
## [41] -0.385158345 -0.377444480 -0.367999137 -0.357234818 -0.345390268
## [46] -0.333657101 -0.320405088 -0.307093151 -0.293895002 -0.279566463
## [51] -0.263105425 -0.246363949 -0.229539058 -0.213137515 -0.196728805
## [56] -0.180700291 -0.164151991 -0.145439160 -0.126569221 -0.107984741
## [61] -0.090254314 -0.072242265 -0.054130291 -0.034756123 -0.014095709
## [66] 0.007187366 0.028500485 0.048915792 0.068729196 0.088785284
## [71] 0.109636722 0.130889993 0.152506791 0.173725394 0.193391213
## [76] 0.211419137 0.228441065 0.245143327 0.260841689 0.274513731
## [81] 0.286671443 0.296525403 0.304596844 0.311148685 0.317183678
## [86] 0.321808447 0.324325398 0.323899788 0.321013182 0.315477248
## [91] 0.307881329 0.298203689 0.286925511 0.272658784 0.256479904
## [96] 0.236600281 0.213678525 0.188829126 0.163809020 0.138368627
## [101] 0.111541632
##
## $freq
## [1] 0.007575758 0.015151515 0.022727273 0.030303030 0.037878788
## [6] 0.045454545 0.053030303 0.060606061 0.068181818 0.075757576
## [11] 0.083333333 0.090909091 0.098484848 0.106060606 0.113636364
## [16] 0.121212121 0.128787879 0.136363636 0.143939394 0.151515152
## [21] 0.159090909 0.166666667 0.174242424 0.181818182 0.189393939
## [26] 0.196969697 0.204545455 0.212121212 0.219696970 0.227272727
## [31] 0.234848485 0.242424242 0.250000000 0.257575758 0.265151515
## [36] 0.272727273 0.280303030 0.287878788 0.295454545 0.303030303
## [41] 0.310606061 0.318181818 0.325757576 0.333333333 0.340909091
## [46] 0.348484848 0.356060606 0.363636364 0.371212121 0.378787879
## [51] 0.386363636 0.393939394 0.401515152 0.409090909 0.416666667
## [56] 0.424242424 0.431818182 0.439393939 0.446969697 0.454545455
## [61] 0.462121212 0.469696970 0.477272727 0.484848485 0.492424242
## [66] 0.500000000
##
## $db
## [1] 8.4297671 13.7645189 12.5249640 8.3502117 4.2830812
## [6] -0.7743908 -1.6306179 -1.6820655 -1.1397181 -1.9660002
## [11] -3.8470478 -4.5601791 -5.7979845 -6.6448260 -8.1613031
## [16] -7.7172776 -7.0289454 -6.7231288 -11.3437841 -7.3292891
## [21] -8.5570448 -9.8123688 -12.3352898 -12.0470153 -10.4188757
## [26] -11.1989248 -11.5484834 -11.3001319 -11.8583444 -13.4758586
## [31] -13.5752424 -13.1601144 -13.4822098 -11.8751077 -12.0098408
## [36] -11.6652933 -14.1857608 -18.7098443 -15.1452524 -14.2793937
## [41] -13.6312562 -13.4537203 -13.8449147 -14.8421169 -15.8144787
## [46] -16.3497508 -16.1392884 -14.5229548 -13.9096257 -14.2979673
## [51] -15.1762427 -16.9533791 -16.6240685 -15.5004175 -14.8930293
## [56] -15.3404690 -17.3904939 -15.6412620 -14.5937114 -14.5227461
## [61] -16.6299676 -17.9885318 -16.8369446 -17.2106857 -15.1218462
## [66] -13.7373278
##
## $dbz
## [1] 10.8000075 10.4225032 9.7877656 8.8879309 7.7133502
## [6] 6.2551734 4.5113555 2.5000875 0.2870472 -1.9727273
## [11] -4.0185421 -5.5981361 -6.6776527 -7.4337186 -8.0352652
## [16] -8.5435576 -8.9672493 -9.3368790 -9.7184801 -10.1759984
## [21] -10.7327082 -11.3585971 -11.9855900 -12.5448786 -13.0053273
## [26] -13.3805586 -13.7009145 -13.9838700 -14.2295004 -14.4361103
## [31] -14.6143036 -14.7849900 -14.9660787 -15.1624073 -15.3669935
## [36] -15.5703444 -15.7685492 -15.9634820 -16.1567949 -16.3451059
## [41] -16.5214784 -16.6812308 -16.8255589 -16.9589464 -17.0832210
## [46] -17.1948352 -17.2887237 -17.3653174 -17.4335644 -17.5062347
## [51] -17.5910578 -17.6848979 -17.7756255 -17.8505813 -17.9054045
## [56] -17.9463455 -17.9845791 -18.0276392 -18.0745244 -18.1172420
## [61] -18.1468027 -18.1590805 -18.1567596 -18.1470189 -18.1375821
## [66] -18.1338177
Since we are seeing heavy wandering behavior, we will use overfit tables to see if we can surface any (1-B) factors that have roots very near the unit circle.
est.ar.wge(us$hospitalizedCurrently,p=6,type='burg')
##
## Coefficients of Original polynomial:
## 1.2113 0.0324 -0.0917 -0.0697 0.0013 -0.1010
##
## Factor Roots Abs Recip System Freq
## 1-1.9283B+0.9354B^2 1.0308+-0.0812i 0.9672 0.0125
## 1+0.9678B+0.3408B^2 -1.4200+-0.9583i 0.5838 0.4055
## 1-0.2507B+0.3168B^2 0.3957+-1.7322i 0.5628 0.2143
##
##
## $phi
## [1] 1.211269271 0.032440618 -0.091744816 -0.069675720 0.001316006
## [6] -0.100966743
##
## $res
## [1] -292.92920 42.23585 -92.70196 -102.02123 -334.70493
## [6] -145.22402 -403.96302 34.46467 -83.49135 1318.62578
## [11] 1377.16436 -880.96039 -636.85824 -375.11269 230.53004
## [16] 589.34824 -162.87147 523.10092 2268.40156 -856.96328
## [21] 1307.41009 4905.29484 -2079.33995 2125.99214 -1561.57046
## [26] -286.55877 -2910.62575 -721.73533 2310.48679 -741.05699
## [31] -1458.33820 -885.25596 -1020.67917 -806.18397 1240.82698
## [36] 3855.28027 -594.27955 -332.98691 -1561.70454 599.16915
## [41] -668.18241 910.28844 838.47931 588.30675 -699.57648
## [46] 648.28101 -290.27821 -792.40531 664.36263 1721.18828
## [51] -247.47143 -655.56939 -1027.72246 -316.65232 -555.31836
## [56] 670.18033 2208.79087 -46.31492 -741.64210 -711.16638
## [61] -345.85017 -802.33939 1055.50075 1019.45748 423.37585
## [66] -263.76408 -870.20100 -934.79262 -213.25332 762.11152
## [71] 755.17733 958.19565 -257.19616 -1107.16022 -1097.48478
## [76] -392.31146 25.12350 266.29333 -150.37710 11.46933
## [81] -24.36326 -230.98962 -270.72490 81.75412 189.17778
## [86] -227.80796 -1100.72037 -289.22665 -380.30212 -216.89601
## [91] 321.96954 641.36609 239.90315 -394.37919 -45.80287
## [96] -932.43620 101.31456 444.74934 1155.39447 225.84784
## [101] -36.98998 -963.47841 90.12262 -630.67985 649.45247
## [106] 1114.69662 348.88102 78.19229 -682.63199 -511.33045
## [111] -33.55819 480.50704 1404.41245 468.10554 -172.16756
## [116] 6696.26244 -2026.25073 -1457.43811 -294.49345 416.89154
## [121] -780.81322 714.03055 -299.92450 -595.44881 59.38061
## [126] 536.60736 999.48047 240.91788 133.31315 -233.45378
## [131] -301.48621 -282.07861
##
## $avar
## [1] 1358148
##
## $aic
## [1] 14.22769
##
## $aicc
## [1] 15.25171
##
## $bic
## [1] 14.38057
us.diff = artrans.wge(us$hospitalizedCurrently, phi.tr = 1, lag.max = 100)
acf(us.diff)
us.diff.seas = artrans.wge(us.diff,phi.tr = c(0,0,0,0,0,0,1), lag.max = 100)
aic5.wge(us.diff.seas)
## ---------WORKING... PLEASE WAIT...
##
##
## Five Smallest Values of aic
## p q aic
## 17 5 1 14.57449
## 10 3 0 14.60386
## 13 4 0 14.61477
## 6 1 2 14.61527
## 8 2 1 14.61581
aic5.wge(us.diff.seas,type = "bic")
## ---------WORKING... PLEASE WAIT...
##
##
## Five Smallest Values of bic
## p q bic
## 7 2 0 14.68775
## 10 3 0 14.69484
## 6 1 2 14.70625
## 8 2 1 14.70679
## 3 0 2 14.72173
ljung.wge(us.diff.seas)$pval
## Obs 0.2522573 0.4096169 0.2803795 0.1452644 0.1021298 0.1354335 -0.2827722 0.07264095 -0.09804601 -0.01062144 0.004067316 -0.04130786 -0.03854601 -0.02866098 -0.09193388 -0.05167819 -0.1056372 -0.1193326 -0.08827532 -0.1061183 -0.1054081 -0.036613 -0.0654078 -0.03489691
## [1] 1.859792e-06
ljung.wge(us.diff.seas, K=48)$pval
## Obs 0.2522573 0.4096169 0.2803795 0.1452644 0.1021298 0.1354335 -0.2827722 0.07264095 -0.09804601 -0.01062144 0.004067316 -0.04130786 -0.03854601 -0.02866098 -0.09193388 -0.05167819 -0.1056372 -0.1193326 -0.08827532 -0.1061183 -0.1054081 -0.036613 -0.0654078 -0.03489691 0.02370281 -0.02743256 -0.02153882 -0.03584166 -0.1039903 -0.02877203 -0.03843455 -0.07298731 -0.03237146 -0.02495354 0.008256594 0.02396111 -0.06576515 -0.04980251 -0.04736059 -0.04415211 -0.03591561 -0.04413551 -0.03016308 0.03982533 0.008384104 0.02503231 0.03614677 0.01834679
## [1] 0.003990771
est.us.diff.seasAIC = est.arma.wge(us.diff.seas, p = 5, q=1)
##
## Coefficients of Original polynomial:
## -0.6097 0.4850 0.5047 0.0643 -0.2269
##
## Factor Roots Abs Recip System Freq
## 1+0.9305B -1.0747 0.9305 0.5000
## 1+0.9405B+0.5775B^2 -0.8143+-1.0337i 0.7599 0.3562
## 1-1.2613B+0.4223B^2 1.4934+-0.3713i 0.6498 0.0388
##
##
mean(us$hospitalizedCurrently)
## [1] 39625.17
est.us.diff.seasBIC = est.arma.wge(us.diff.seas, p = 2)
##
## Coefficients of Original polynomial:
## 0.1594 0.3740
##
## Factor Roots Abs Recip System Freq
## 1-0.6964B 1.4359 0.6964 0.0000
## 1+0.5370B -1.8622 0.5370 0.5000
##
##
mean(us$hospitalizedCurrently)
## [1] 39625.17
shortARMA <- fore.aruma.wge(us$hospitalizedCurrently, phi = est.us.diff.seasAIC$phi, theta = est.us.diff.seasAIC$theta, d= 1, s = 7, n.ahead = 7, lastn = F, limits = T)
est.us.diff.seasAIC$aic
## [1] 14.57449
phis = est.us.diff.seasAIC$phi
thetas = est.us.diff.seasAIC$theta
trainingSize = 24
horizon = 7
ASEHolder = numeric()
for( i in 1:(124-(trainingSize + horizon) + 1))
{
forecasts = fore.aruma.wge(us$hospitalizedCurrently[i:(i+(trainingSize-1))],phi = phis, theta = thetas, s = 7, d = 1, n.ahead = horizon)
ASE = mean((us$hospitalizedCurrently[(trainingSize+i):(trainingSize+ i + (horizon) - 1)] - forecasts$f)^2)
ASEHolder[i] = ASE
}
invisible(ASEHolder)
hist(ASEHolder)
WindowedASE = mean(ASEHolder)
summary(ASEHolder)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 39401 754593 2108050 14880281 6998084 176230307
WindowedASE
## [1] 14880281
fs = fore.aruma.wge(us$hospitalizedCurrently[i:(i+(trainingSize+horizon)-1)],phi = phis, theta = thetas, s = 7, d = 1,n.ahead = 7, lastn = TRUE)
ASE = mean((us$hospitalizedCurrently[(i+trainingSize):(i+(trainingSize+horizon)-1)] - fs$f )^2)
ASE
## [1] 73605156
longARMA <- fore.aruma.wge(us$hospitalizedCurrently, phi = est.us.diff.seasAIC$phi, theta = est.us.diff.seasAIC$theta, d= 1, s = 7, n.ahead = 90, lastn = F, limits = F)
phis = est.us.diff.seasAIC$phi
thetas = est.us.diff.seasAIC$theta
trainingSize = 24
horizon = 90
ASEHolder = numeric()
for( i in 1:(124-(trainingSize + horizon) + 1))
{
forecasts = fore.aruma.wge(us$hospitalizedCurrently[i:(i+(trainingSize-1))],phi = phis, theta = thetas, s = 7, d = 1,n.ahead = horizon)
ASE = mean((us$hospitalizedCurrently[(trainingSize+i):(trainingSize+ i + (horizon) - 1)] - forecasts$f)^2)
ASEHolder[i] = ASE
}
invisible(ASEHolder)
hist(ASEHolder)
WindowedASE = mean(ASEHolder)
summary(ASEHolder)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.083e+08 6.202e+08 8.410e+09 1.810e+10 2.989e+10 6.151e+10
WindowedASE
## [1] 1.8103e+10
fs = fore.aruma.wge(us$hospitalizedCurrently[i:(i+(trainingSize+horizon)-1)],phi = phis, theta = thetas, s = 7, d = 1, n.ahead = 90, lastn = T)
ASE = mean((us$hospitalizedCurrently[(i+trainingSize):(i+(trainingSize+horizon)-1)] - fs$f )^2)
ASE
## [1] 108278159
shortAR <- fore.aruma.wge(us$hospitalizedCurrently, phi = est.us.diff.seasBIC$phi, d=1, s=7, n.ahead = 7, lastn = FALSE, limits = FALSE)
est.us.diff.seasBIC$aic
## [1] 14.61952
phis = est.us.diff.seasBIC$phi
thetas = est.us.diff.seasBIC$theta
trainingSize = 24
horizon = 7
ASEHolder = numeric()
for( i in 1:(124-(trainingSize + horizon) + 1))
{
forecasts = fore.aruma.wge(us$hospitalizedCurrently[i:(i+(trainingSize-1))],phi = phis, theta = thetas, s = 7, d = 1, n.ahead = horizon)
ASE = mean((us$hospitalizedCurrently[(trainingSize+i):(trainingSize+ i + (horizon) - 1)] - forecasts$f)^2)
ASEHolder[i] = ASE
}
invisible(ASEHolder)
hist(ASEHolder)
WindowedASE = mean(ASEHolder)
summary(ASEHolder)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 77157 618880 2264284 15546758 6837785 202401118
WindowedASE
## [1] 15546758
fs = fore.aruma.wge(us$hospitalizedCurrently[i:(i+(trainingSize+horizon)-1)],phi = phis, theta = thetas, s = 7, d = 1,n.ahead = 7, lastn = TRUE)
ASE = mean((us$hospitalizedCurrently[(i+trainingSize):(i+(trainingSize+horizon)-1)] - fs$f )^2)
ASE
## [1] 60809514
longAR <- fore.aruma.wge(us$hospitalizedCurrently, phi = est.us.diff.seasBIC$phi, s= 7, d = 1, n.ahead = 90, lastn = FALSE, limits = FALSE)
phis = est.us.diff.seasBIC$phi
thetas = est.us.diff.seasBIC$theta
trainingSize = 24
horizon = 90
ASEHolder = numeric()
for( i in 1:(124-(trainingSize + horizon) + 1))
{
forecasts = fore.aruma.wge(us$hospitalizedCurrently[i:(i+(trainingSize-1))],phi = phis, theta = thetas, s = 7, d = 1,n.ahead = horizon)
ASE = mean((us$hospitalizedCurrently[(trainingSize+i):(trainingSize+ i + (horizon) - 1)] - forecasts$f)^2)
ASEHolder[i] = ASE
}
ASEHolder
## [1] 61052232487 54487603507 35485946674 27887241909 17356997661
## [6] 7212077636 8960477292 598137404 130355537 347397993
## [11] 185865366
hist(ASEHolder)
WindowedASE = mean(ASEHolder)
summary(ASEHolder)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.304e+08 4.728e+08 8.960e+09 1.943e+10 3.169e+10 6.105e+10
WindowedASE
## [1] 19427666679
fs = fore.aruma.wge(us$hospitalizedCurrently[i:(i+(trainingSize+horizon)-1)],phi = phis, theta = thetas, s = 7, d = 1,n.ahead = 90, lastn = TRUE)
ASE = mean((us$hospitalizedCurrently[(i+trainingSize):(i+(trainingSize+horizon)-1)] - fs$f )^2)
ASE
## [1] 185865366
For our univariate NN model we created a training and test data set. This allows us to cross validate our model performance. This is our first NN model and will be used with mostly default parameters. This is to see how our mlp function does in producing a model with few settings. However, with such little data we are also curious how leveraging all of the data changes the trend on the forecast. So, we will model them side by side to see the difference on what the forecasts produce.
library(nnfor)
head(us)
## date positive negative pending hospitalizedCurrently
## 132 2020-03-17 11928 63104 1687 325
## 131 2020-03-18 15099 84997 2526 416
## 130 2020-03-19 19770 108407 3016 617
## 129 2020-03-20 26025 138814 3330 1042
## 128 2020-03-21 32910 177262 3468 1436
## 127 2020-03-22 42169 213476 2842 2155
## hospitalizedCumulative inIcuCurrently inIcuCumulative
## 132 55 0 0
## 131 67 0 0
## 130 85 0 0
## 129 108 0 0
## 128 2020 0 0
## 127 3023 0 0
## onVentilatorCurrently onVentilatorCumulative recovered death
## 132 0 0 0 122
## 131 0 0 0 153
## 130 0 0 0 199
## 129 0 0 0 267
## 128 0 0 0 328
## 127 0 0 0 471
## totalTestResults deathIncrease hospitalizedIncrease negativeIncrease
## 132 75032 22 13 13707
## 131 100096 31 12 21893
## 130 128177 46 18 23410
## 129 164839 68 23 30407
## 128 210172 61 1912 38448
## 127 255645 143 1003 36214
## positiveIncrease
## 132 3613
## 131 3171
## 130 4671
## 129 6255
## 128 6885
## 127 9259
usTrain.nn = ts(us$hospitalizedCurrently[1:125])
us.nn.fit = mlp(usTrain.nn, outplot = T, comb = "mean", m=7, reps = 50)
plot(us.nn.fit)
us.nn.fit2 = mlp(ts(us$hospitalizedCurrently), outplot = T, comb = "mean", m=7, reps = 50)
plot(us.nn.fit2)
us.nn.fit.fore7 = forecast(us.nn.fit, h=7)
plot(us.nn.fit.fore7)
us.nn.fit.fore90 = forecast(us.nn.fit, h=90)
plot(us.nn.fit.fore90)
us.nn.fit.fore2 = forecast(us.nn.fit2, h=7)
plot(us.nn.fit.fore2)
us.nn.fit.fore2 = forecast(us.nn.fit2, h=90)
plot(us.nn.fit.fore2)
plot(us$hospitalizedCurrently[126:132], type = "l", ylim = c(55000, 80000))
lines(seq(1:7), us.nn.fit.fore7$mean, col = "blue")
ASEus.nn.fit.fore7 = mean((us$hospitalizedCurrently[126:132]-us.nn.fit.fore7$mean)^2)
ASEus.nn.fit.fore7
## [1] 1777856
-90-Day
ASEus.nn.fit.fore90 = mean((us$hospitalizedCurrently[43:132]-us.nn.fit.fore90$mean)^2)
ASEus.nn.fit.fore90
## [1] 427549097
We completed a default neural network model. With so many opportunities for how to actually tune neural network model we knew this would not be our best model in this case. So, we moved forward with a hyper tuned neural network model for our ensemble model that allows us to calculate many windowed ASEs and compare those models against each other.
library(tswgewrapped)
set.seed(3)
data_train.u <- data.frame(hospitalizedCurrently = us$hospitalizedCurrently[1:122], positiveIncrease = rnorm(122, 0, .0001))
data_test.u <- data.frame(hospitalizedCurrently = us$hospitalizedCurrently[123:132], positiveIncrease = rnorm(10, 0, .0001))
# search for best NN hyperparameters in given grid
model.u = tswgewrapped::ModelBuildNNforCaret$new(data = data_train.u, var_interest = "hospitalizedCurrently",
search = 'random', tuneLength = 5, parallel = TRUE,
batch_size = 50, h = 7, m = 7,
verbose = 1)
## Registered S3 method overwritten by 'lava':
## method from
## print.pcor greybox
## Aggregating results
## Selecting tuning parameters
## Fitting reps = 16, hd = 1, allow.det.season = FALSE on full training set
## - Total Time for training: : 46.573 sec elapsed
res.u <- model.u$summarize_hyperparam_results()
res.u
## reps hd allow.det.season RMSE ASE RMSESD ASESD
## 1 10 2 TRUE 4873.573 46711465 5025.507 79166325
## 2 11 3 FALSE 3840.162 31663386 4313.721 69744266
## 3 16 1 FALSE 2953.278 11858347 1857.457 10343460
## 4 16 3 TRUE 3991.706 33375228 4380.144 70312087
## 5 22 3 FALSE 3957.298 32780242 4339.590 70283659
model.u$plot_hyperparam_results()
best.u <- model.u$summarize_best_hyperparams()
final.ase.u <- dplyr::filter(res.u, reps == best.u$reps &
hd == best.u$hd &
allow.det.season == best.u$allow.det.season)[['ASE']]
final.ase.u
## [1] 11858347
# Ensemble / Hypertuned NN Model
caret_model.u = model.u$get_final_models(subset = 'a')
caret_model.u$finalModel
## MLP fit with 1 hidden node and 16 repetitions.
## Series modelled in differences: D1.
## Univariate lags: (1,2,3)
## 1 regressor included.
## - Regressor 1 lags: (4)
## Forecast combined using the median operator.
## MSE: 1427611.8956.
#Plot Final Model
plot(caret_model.u$finalModel)
#Ensemble model trained data
ensemble.mlp.u1 = nnfor::mlp(usTrain.nn, outplot = T, reps = best.u$reps, hd = best.u$hd, allow.det.season = F)
ensemble.mlp.u1
## MLP fit with 1 hidden node and 16 repetitions.
## Series modelled in differences: D1.
## Univariate lags: (1,2,3)
## Forecast combined using the median operator.
## MSE: 1452364.1472.
#Ensemble model
ensemble.mlp.u2 = nnfor::mlp(ts(us$hospitalizedCurrently), outplot = T, reps = best.u$reps, hd = best.u$hd, allow.det.season = F)
ensemble.mlp.u2
## MLP fit with 1 hidden node and 16 repetitions.
## Series modelled in differences: D1.
## Univariate lags: (1,2,3)
## Forecast combined using the median operator.
## MSE: 1386442.1448.
fore7.1.u = forecast(ensemble.mlp.u1 , h=7)
#grabbing prediction intervals for 7 day forecast
all.mean7 <- data.frame(fore7.1.u$all.mean)
ranges7 <- data.frame(apply(all.mean7, MARGIN = 1, FUN = range))
subtracts7 <- ranges7 - as.list(ranges7[1,])
nintyperc7 <- data.frame(mapply(`*`,subtracts7,.9,SIMPLIFY=FALSE))
diffs7 <- data.frame(mapply(`/`,nintyperc7,2,SIMPLIFY = FALSE))
diffs7 = diffs7[-1,]
vector7 <- as.numeric(diffs7[1,])
plot(fore7.1.u)
lines(seq(126,132,1), (fore7.1.u$mean + vector7), type = "l", col = "red")
lines(seq(126,132,1), (fore7.1.u$mean - vector7), type = "l", col = "red")
fore7.2.u = forecast(ensemble.mlp.u2, h=7)
plot(fore7.2.u)
fore90.1.u = forecast(ensemble.mlp.u1, h=90)
#grabbing prediction intervals for 90 day forecast
all.mean90 <- data.frame(fore90.1.u$all.mean)
ranges90 <- data.frame(apply(all.mean90, MARGIN = 1, FUN = range))
subtracts90 <- ranges90 - as.list(ranges90[1,])
nintyperc90 <- data.frame(mapply(`*`,subtracts90,.9,SIMPLIFY=FALSE))
diffs90 <- data.frame(mapply(`/`,nintyperc90,2,SIMPLIFY = FALSE))
diffs90 = diffs90[-1,]
vector90 <- as.numeric(diffs90[1,])
plot(fore90.1.u)
lines(seq(126,215,1), (fore90.1.u$mean + vector90), type = "l", col = "red")
lines(seq(126,215,1), (fore90.1.u$mean - vector90), type = "l", col = "red")
fore90.2.u = forecast(ensemble.mlp.u2, h=90)
plot(fore90.2.u)
Upon completion of the above models we can see that the most important take away is that each data point is essential in determining the trend of the COVID virus. We can see that with cross validation methods we can see a trend but as each of those data points become a piece of the model the trend alters day by day. It will be essential moving forward that models are update daily to be able to acquire a good trend and therefore ability to forecast the needs for hospitalizes and the severity of COVID moving forward.
When investigating these models, it became clear that the 90-day forecasts were simply repeating the trend and seasonality without much extrapolation that we would recommend using for long term forecast. We would only recommend using the short 7 day forecast for predicting hospital equipment and staffing needs. The ensemble model had the lowest windowed ASE and is what we recommend moving forward for these short term forecasts.
In order to prep our data from our EDA we will load, transform the date, remove all zero row from the variable we are testing hospitalizedCurrently, and sort from oldest to newest. This will allow us to easily work through our multivariate time series evaluations.
us <- read.csv("https://raw.githubusercontent.com/JaclynCoate/6373_Time_Series/master/TermProject/Data/USdaily.7.26.20.csv", header = T, strip.white = T)
us <- transform(us, date = as.Date(as.character(date), "%Y%m%d"))
us <- subset(us, select = -c(states, dateChecked, hospitalized, lastModified, total, posNeg, totalTestResultsIncrease, hash))
us[is.na(us)] <- 0
us = us[order(as.Date(us$date, format = "%Y%m%d")),]
head(us)
## date positive negative pending hospitalizedCurrently
## 187 2020-01-22 2 0 0 0
## 186 2020-01-23 2 0 0 0
## 185 2020-01-24 2 0 0 0
## 184 2020-01-25 2 0 0 0
## 183 2020-01-26 2 0 0 0
## 182 2020-01-27 2 0 0 0
## hospitalizedCumulative inIcuCurrently inIcuCumulative
## 187 0 0 0
## 186 0 0 0
## 185 0 0 0
## 184 0 0 0
## 183 0 0 0
## 182 0 0 0
## onVentilatorCurrently onVentilatorCumulative recovered death
## 187 0 0 0 0
## 186 0 0 0 0
## 185 0 0 0 0
## 184 0 0 0 0
## 183 0 0 0 0
## 182 0 0 0 0
## totalTestResults deathIncrease hospitalizedIncrease negativeIncrease
## 187 2 0 0 0
## 186 2 0 0 0
## 185 2 0 0 0
## 184 2 0 0 0
## 183 2 0 0 0
## 182 2 0 0 0
## positiveIncrease
## 187 0
## 186 0
## 185 0
## 184 0
## 183 0
## 182 0
It is difficult to assume stationarity for this data due to multiple factors. We are working under the assumption that COVID is a novel virus and cases as well as hospitalizations will eventually return to zero. This being said our current modeling techniques do things such as return to the median or mimic the previously seen trends. Also, we see a heavy wandering trend in both new cases and hospitalization would be dependent on this as well as time. We will review the data and see what, if any, non-stationary components reveal themselves and model the data accordingly.
Traits:
ggplot(data = us, aes(x=date, y=hospitalizedCurrently))+
geom_line(color="orange")+
labs(title = "Current COVID Hospitalized Cases US", y = "Thousands", x = "") +
theme_fivethirtyeight()
When reviewing the variables and the scatter plot from our previous EDA we can see that there are some correlations that were expected, for example currentHospitalizations is correlated with the variables that reflect ICU and Ventilator patients. These metrics wouldn’t exist without hospitalizations. These variables also cannot help up predict hospitalizations because they after occurrences. So, we will actually be leveraging variables such as positive increase in order to see if there is some sort of correlation between hospitalized patients and the number of positive cases.
ggplot(data = us, aes(x=date, y=positiveIncrease))+
geom_line(color="orange")+
labs(title = "Increase in COVID Cases US", y = "Thousands", x = "") +
theme_fivethirtyeight()
Since we are working with a seasonal and transformation component, but it requires an additional transformation for the total positive COVID cases to become stationary for evaluation, we will only use positive increase of cases to predict currently hospitalized cases for the VAR model.
#Positive Cases Transformations
#pos.diff = artrans.wge(us$positive, phi.tr = 1, lag.max = 100)
#pos.diff.2 = artrans.wge(pos.diff, phi.tr = 1, lag.max = 100)
#pos.trans = artrans.wge(pos.diff.2,phi.tr = c(0,0,0,0,0,0,1), lag.max = 100)
#Increase Positive Transformation
inpos.diff = artrans.wge(us$positiveIncrease, phi.tr = 1, lag.max = 100)
inpos.trans = artrans.wge(inpos.diff,phi.tr = c(0,0,0,0,0,0,1), lag.max = 100)
#Current Hospitalization Transformations
us.diff = artrans.wge(us$hospitalizedCurrently, phi.tr = 1, lag.max = 100)
currhosp.trans = artrans.wge(us.diff,phi.tr = c(0,0,0,0,0,0,1), lag.max = 100)
#VARselect to choose lag
VARselect(cbind(currhosp.trans, inpos.trans), lag.max = 10, type= "both")
## $selection
## AIC(n) HQ(n) SC(n) FPE(n)
## 7 7 2 7
##
## $criteria
## 1 2 3 4 5
## AIC(n) 3.093857e+01 3.083329e+01 3.079091e+01 3.076976e+01 3.069554e+01
## HQ(n) 3.101650e+01 3.095018e+01 3.094676e+01 3.096458e+01 3.092932e+01
## SC(n) 3.113058e+01 3.112131e+01 3.117493e+01 3.124979e+01 3.127158e+01
## FPE(n) 2.731960e+13 2.459304e+13 2.357880e+13 2.309552e+13 2.145770e+13
## 6 7 8 9 10
## AIC(n) 3.066656e+01 3.051427e+01 3.053424e+01 3.059674e+01 3.059649e+01
## HQ(n) 3.093930e+01 3.082598e+01 3.088492e+01 3.098638e+01 3.102509e+01
## SC(n) 3.133860e+01 3.128233e+01 3.139830e+01 3.155681e+01 3.165257e+01
## FPE(n) 2.086403e+13 1.793906e+13 1.833019e+13 1.955149e+13 1.959499e+13
#fit model
usdiff.var.fit = VAR(cbind(currhosp.trans, inpos.trans), type = "both", p = 2)
#AIC: 30.51427
#7 day predictions
pred.var7 = predict(usdiff.var.fit, n.ahead = 7)
pred.var7$fcst$currhosp.trans[,1:3]
## fcst lower upper
## [1,] -42.12350 -2942.410 2858.163
## [2,] -385.17371 -3346.493 2576.145
## [3,] -80.65766 -3262.529 3101.214
## [4,] -124.14161 -3327.324 3079.041
## [5,] -46.62794 -3284.538 3191.282
## [6,] -43.58380 -3288.207 3201.039
## [7,] -11.23201 -3262.277 3239.813
#90 day predictions
pred.var90 = predict(usdiff.var.fit, n.ahead = 90)
invisible(pred.var90$fcst$currhosp.trans)
startingPoints7 = us$hospitalizedCurrently[126:132]
currHospForecasts7 = pred.var7$fcst$currhosp.trans[,1:3] + startingPoints7
startingPoints90 = us$hospitalizedCurrently[43:132]
currHospForecasts90 = pred.var90$fcst$currhosp.trans[,1:3] + startingPoints90
#7 day Forecasts
plot(seq(1,132,1), us$hospitalizedCurrently, type = "l", xlim = c(0,139), ylim = c(0,62000), ylab = "Currently Hospitalized COVID Patients", main = "7 Day Currently Hospitalized Patients Forecast")
lines(seq(133,139,1), as.data.frame(currHospForecasts7)$fcst, type = "l", col = "red")
lines(seq(133,139,1), as.data.frame(currHospForecasts7)$upper, type = "l", col = "blue")
lines(seq(133,139,1), as.data.frame(currHospForecasts7)$lower, type = "l", col = "blue")
- 90 Day
#90 day Forecasts
plot(seq(1,132,1), us$hospitalizedCurrently, type = "l", xlim = c(0,222), ylim = c(0,62000), ylab = "Currently Hospitalized COVID Patients", main = "90 Day Currently Hospitalized Patients Forecast")
lines(seq(133,222,1), as.data.frame(currHospForecasts90)$fcst, type = "l", col = "red")
lines(seq(133,222,1), as.data.frame(currHospForecasts90)$upper, type = "l", col = "blue")
lines(seq(133,222,1), as.data.frame(currHospForecasts90)$lower, type = "l", col = "blue")
varASE7 = mean((us$hospitalizedCurrently[126:132]-currHospForecasts7[1:7])^2)
varASE7
## [1] 25178.55
varASE7 = mean((us$hospitalizedCurrently[43:132]-currHospForecasts90[1:90])^2)
varASE7
## [1] 10791.59
Since we have been looking at additional regressors for all our above models and know that the best regressor to leverage will be positive increase of COVID cases. This regressor reflects similar behavior to that of current hospitalizations and can be model properly against hospitalizations. 1. Create Current Hospitalized ts variable
tUScurrhop = ts(us$hospitalizedCurrently)
tUSx = data.frame(positiveIncrease = ts(us$positiveIncrease))
fit.mlp.new = mlp(ts(tUSx), reps = 50, comb = "mean")
mlp.new.fore7 = forecast(fit.mlp.new, h = 7)
invisible(mlp.new.fore7)
mlp.new.fore90 = forecast(fit.mlp.new, h = 90)
invisible(mlp.new.fore90)
new.regressor7 <- data.frame(c(us$positiveIncrease, mlp.new.fore7$mean))
invisible(new.regressor7)
new.regressor90 <- data.frame(c(us$positiveIncrease, mlp.new.fore90$mean))
invisible(new.regressor90)
mlp.fit1 = mlp(tUScurrhop, xreg = tUSx, outplot = T, comb = "mean")
plot(mlp.fit1)
currhosp.fore7 = forecast(mlp.fit1, h = 7, xreg = new.regressor7)
plot(currhosp.fore7)
- Currently Hospitalized 90 Day Forecast w/ Positive Increase Regressor
currhosp.fore90 = forecast(mlp.fit1, h = 90, xreg = new.regressor90)
plot(currhosp.fore90)
#ASEmlr7 = mean((us$hospitalizedCurrently[126:132]-currhosp.fore7$mean)^2)
#ASEmlr7
tUScurrhop2 = ts(us$hospitalizedCurrently[1:125])
tUSx2 = data.frame(positiveIncrease = ts(us$positiveIncrease[1:125]))
fit.mlp.new2 = mlp(ts(tUSx2), reps = 50, comb = "mean")
mlp.new.fore7.2 = forecast(fit.mlp.new2, h = 7)
invisible(mlp.new.fore7.2)
new.regressor7.2 <- data.frame(c(us$positiveIncrease[1:125], mlp.new.fore7.2$mean))
invisible(new.regressor7.2)
mlp.fit1.2 = mlp(tUScurrhop2, xreg = tUSx2, comb = "mean")
#plot(mlp.fit1.2)
currhosp.fore7.2 = forecast(mlp.fit1.2, h = 7, xreg = new.regressor7.2)
#plot(currhosp.fore7.2)
ASEmlr7.2 = mean((us$hospitalizedCurrently[126:132]-currhosp.fore7.2$mean)^2)
ASEmlr7.2
## [1] 721093.4
#ASEmlr90 = mean((us$hospitalizedCurrently[43:132]-currhosp.fore90$mean)^2)
#ASEmlr90
tUScurrhop3 = ts(us$hospitalizedCurrently[1:42])
tUSx3 = data.frame(positiveIncrease = ts(us$positiveIncrease[1:42]))
fit.mlp.new3 = mlp(ts(tUSx3), reps = 50, comb = "mean")
mlp.new.fore7.3 = forecast(fit.mlp.new3, h = 90)
invisible(mlp.new.fore7.3)
new.regressor7.3 <- data.frame(c(us$positiveIncrease[1:42], mlp.new.fore7.3$mean))
invisible(new.regressor7.3)
mlp.fit1.3 = mlp(tUScurrhop3, xreg = tUSx2, comb = "mean")
#plot(mlp.fit1.3)
currhosp.fore7.3 = forecast(mlp.fit1.3, h = 90, xreg = new.regressor7.3)
#plot(currhosp.fore7.3)
ASEmlr7.3 = mean((us$hospitalizedCurrently[43:132]-currhosp.fore7.3$mean)^2)
ASEmlr7.3
## [1] 4562909805
We completed a default neural network model. With so many opportunities for how to actually tune neural network model we knew this would not be our best model in this case. So we moved forward with a hyper tuned neural network model that allows us to calculate many windowed ASEs and compare those model against each other.
We have leveraged the tswgewrapper code above to produce a hyperparameter tuned NN model for our ensemble model. This model will work as our higher functioning ensemble
#if (!require(devtools)) {
# install.packages("devtools")
#}
#devtools::install_github("josephsdavid/tswgewrapped", build_vignettes = TRUE)
library(tswgewrapped)
set.seed(3)
data_train.m <- data.frame(hospitalizedCurrently = us$hospitalizedCurrently[1:122], positiveIncrease = us$positiveIncrease[1:122])
data_test.m <- data.frame(hospitalizedCurrently = us$hospitalizedCurrently[123:132], positiveIncrease = us$positiveIncrease[123:132])
# search for best NN hyperparameters in given grid
model.m = tswgewrapped::ModelBuildNNforCaret$new(data = data_train.m, var_interest = "hospitalizedCurrently",
search = 'random', tuneLength = 5, parallel = TRUE,
batch_size = 50, h = 7, m = 7,
verbose = 1)
## Aggregating results
## Selecting tuning parameters
## Fitting reps = 10, hd = 2, allow.det.season = TRUE on full training set
## - Total Time for training: : 41.571 sec elapsed
res.m <- model.m$summarize_hyperparam_results()
res.m
## reps hd allow.det.season RMSE ASE RMSESD ASESD
## 1 10 2 TRUE 3103.429 13970241 2184.689 16296534
## 2 11 3 FALSE 3129.341 14942705 2380.109 19673017
## 3 16 1 FALSE 3559.638 16480775 2047.127 15988977
## 4 16 3 TRUE 3204.752 15391189 2373.358 19509361
## 5 22 3 FALSE 3268.191 16196849 2463.200 19979406
model.m$plot_hyperparam_results()
best.m <- model.m$summarize_best_hyperparams()
best.m
## reps hd allow.det.season
## 1 10 2 TRUE
final.ase.m <- dplyr::filter(res.m, reps == best.m$reps &
hd == best.m$hd &
allow.det.season == best.m$allow.det.season)[['ASE']]
final.ase.m
## [1] 13970241
# Final Model
caret_model.m = model.m$get_final_models(subset = 'a')
caret_model.m$finalModel
## MLP fit with 2 hidden nodes and 10 repetitions.
## Series modelled in differences: D1.
## Univariate lags: (1,2,3)
## 1 regressor included.
## - Regressor 1 lags: (1,3)
## Forecast combined using the median operator.
## MSE: 1065689.5459.
# Plot Final Model
plot(caret_model.m$finalModel)
#Ensemble model
ensemble.mlp = mlp(tUScurrhop, xreg = tUSx, outplot = T, reps = 10, hd = 2, allow.det.season = T)
ensemble.mlp
## MLP fit with 2 hidden nodes and 10 repetitions.
## Series modelled in differences: D1.
## Univariate lags: (1,2,3)
## 1 regressor included.
## - Regressor 1 lags: (1,3)
## Forecast combined using the median operator.
## MSE: 1025155.9706.
#Plot ensemble model
plot(ensemble.mlp)
fore7.m = forecast(ensemble.mlp, xreg = new.regressor7, h=7)
#grabbing prediction intervals for 7 day forecast
all.mean.m7 <- data.frame(fore7.m$all.mean)
ranges.m7 <- data.frame(apply(all.mean.m7, MARGIN = 1, FUN = range))
subtracts.m7 <- ranges.m7 - as.list(ranges.m7[1,])
nintyperc.m7 <- data.frame(mapply(`*`,subtracts.m7,.9,SIMPLIFY=FALSE))
diffs.m7 <- data.frame(mapply(`/`,nintyperc.m7,2,SIMPLIFY = FALSE))
diffs.m7 = diffs.m7[-1,]
vector.m7 <- as.numeric(diffs.m7[1,])
plot(fore7.m)
lines(seq(133,139,1), (fore7.m$mean + vector.m7), type = "l", col = "red")
lines(seq(133,139,1), (fore7.m$mean - vector.m7), type = "l", col = "red")
fore90.m = forecast(ensemble.mlp, xreg = new.regressor90, h=90)
#grabbing prediction intervals for 90 day forecast
all.mean.m90 <- data.frame(fore90.m$all.mean)
ranges.m90 <- data.frame(apply(all.mean.m90, MARGIN = 1, FUN = range))
subtracts.m90 <- ranges.m90 - as.list(ranges.m90[1,])
nintyperc.m90 <- data.frame(mapply(`*`,subtracts.m90,.9,SIMPLIFY=FALSE))
diffs.m90 <- data.frame(mapply(`/`,nintyperc.m90,2,SIMPLIFY = FALSE))
diffs.m90 = diffs.m90[-1,]
vector.m90 <- as.numeric(diffs.m90[1,])
plot(fore90.m)
lines(seq(133,222,1), (fore90.m$mean + vector.m90), type = "l", col = "red")
lines(seq(133,222,1), (fore90.m$mean - vector.m90), type = "l", col = "red")
ASEmlr7 = mean((us$hospitalizedCurrently[126:132]-fore7.m$mean)^2)
ASEmlr7
## [1] 2174937
ASEmlr90 = mean((us$hospitalizedCurrently[43:132]-fore90.m$mean)^2)
ASEmlr90
## [1] 2239813584
We started our multivariate analysis using multiple regression with correlated errors models. We ended up producing two models, one with and without a trend. We predicted that a trend (or time) would be a deciding variable in which of these models would be outperform the other. However, we expected trend to be the better model. When we compared the two via their AICs, we found the MLR model without trend performed better both with an AIC and when we compared the ASEs between the two model types. When applying our domain knowledge, we did come to the conclusion that time would not necessarily be a strong determinant for the severity of COVID since we have yet to observe, and subsequently able to analyze, a full cycle of the virus effect on the US population. Once we are able to analyze data to this level, we would expect a time correlation and therefore lag/trend model performing better in the predictions of the virus’s severity and therefore hospitalized patients.
Next we completed a Vector AR (VAR) model analysis of our data. This modeling processes requires the data of both the independent and dependent variables to be stationary. Which means we are actually modeling on the residuals of the data. This also results in the predictions/forecasts being based on differences of the forecasts and therefore the forecasts have to be calculated based on the previous periods values. This modeling techniques is highly sensitive to the previously observed values, which in the case of COVID is essential for understanding the severity of COVID. Since we have yet to observe a full cycle the modeling for predicting hospitalized cases should be closely based on what we’ve previously observed. So far this is the better of the three models we’ve produced.
For our final 2 models we performed multilayered perceptron or neural network model. For our first model we used a mostly default hyper parameters for the model. The ASEs returned with cross validation are much higher than the VAR model. We see it in the billions for the 90-day window forecast. This is as expected since the NN model creates multiple scenarios and takes the average of those in order to forecast moving forward. This means even the highest and lowest are incorporated into the forecast. While some of the scenarios are not statistically likely the NN model is attempting to create a forecast that takes these possibilities into account. In order to help created a better neural network we created a hyper parameter turned model. This model produced ASEs much lower than then default parameter neural network model. The forecasts for the hyper tuned NN model are slightly less dispersed than the default NN model telling us that these predictions are more helpful. Hyper tuning the additional parameters allows the model to be closer in predicting future hospitalized cases. We performed windowed ASEs in order to pick the best hyper tuned parameters for our advanced neural network model.
Upon reviewing our 90-day forecasts and ASEs we have concluded it would not be a statistically sound decision to run a 90-day forecast and allow it to make long term decisions until we have more data. Upon performing 90-day forecasts we end up having to train the model on 20% of the data and test it on 80% of the data. This is the exact opposite of modeling a statistically sound time series for prediction. For our COVID predictions of severity we will advise to only to produce forecasts and ASE with 7-day forecasts but have produced the ASE for reference below. This applies to all of our model’s cross validation efforts and ASE calculations.
We have chosen two models to help us predict severity with the COVID hospitalized. We will leverage the vector AR model for short term forecasts This model allows us to base our predictions off of previously observed values. Since COVID has not completed a full cycle and has shown heavy wandering behavior in the realization, we feel using this method will be the closest in order to get us prepared for forecasts for our 7-day and doing short term planning for the hospitalizes supplies and staffing with prediction intervals to help us make sure we account for possible peaks and valleys in our forecast.
For our long term forecasting we have chosen to go with our hyper tuned parameter neural network model. This model is helpful for long term forecasting because it has the ability to adjust and reapply the most effective model based on the newest data input with daily updates. We’ve calculated 90% confidence intervals based on the range of the forecasted possibilities. We will continue to calculate all of the probably outcomes and produce a mean forecast for us to base all planning. The hyper parameter turned neural network model takes into account multiple possibilities and therefore does give us the most statistically useful forecast for our 90-day period.
It is essential both of these models be updated daily. COVID cases and hospitalizations changes frequently and the only way to continue to forecast and prepare as effectively as possible is to have updated models with as much historical data being taken into account as possible. As we move forward with presenting these models we’ve deicded on, it is important to remember these just appear to be the most useful models, and while we can work from them, none of them will be correct.
#VAR 7-Day Forecasts w/ Prediction Intervals
plot(seq(1,132,1), us$hospitalizedCurrently, type = "l", xlim = c(0,139), ylim = c(0,62000), ylab = "Currently Hospitalized COVID Patients", main = "7 Day Currently Hospitalized Patients Forecast")
lines(seq(133,139,1), as.data.frame(currHospForecasts7)$fcst, type = "l", col = "red")
lines(seq(133,139,1), as.data.frame(currHospForecasts7)$upper, type = "l", col = "blue")
lines(seq(133,139,1), as.data.frame(currHospForecasts7)$lower, type = "l", col = "blue")
#Hyper Tuned Parameter Neural Network Model Forecast w/ Prediction Intervals
plot(fore90.m)
lines(seq(133,222,1), (fore90.m$mean + vector.m90), type = "l", col = "red")
lines(seq(133,222,1), (fore90.m$mean - vector.m90), type = "l", col = "red")
CA <- read.csv("https://raw.githubusercontent.com/JaclynCoate/6373_Time_Series/master/TermProject/Data/CA_COVID_7.16.20.csv", header = T)
#Re-format Date
CA$date <- as.Date(CA$date, format = "%m/%d/%Y")
head(CA)
## date newtested testedtotal newcountconfirmed totalcountconfirmed
## 1 2020-03-18 6291 6291 675 675
## 2 2020-03-19 5196 11487 331 1006
## 3 2020-03-20 1041 12528 218 1224
## 4 2020-03-21 939 13467 244 1468
## 5 2020-03-22 850 14317 265 1733
## 6 2020-03-23 1237 15554 222 1955
## newpospercent pospercent_14dayavg newcountdeaths totalcountdeaths
## 1 10.7 10.7 15 15
## 2 6.4 8.8 3 18
## 3 20.9 9.8 4 22
## 4 26.0 10.9 4 26
## 5 31.2 12.1 9 35
## 6 17.9 12.6 2 37
## hospitalized_covid_confirmed_patients
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## hospitalized_suspected_covid_patients hospitalized_covid_patients
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## all_hospital_beds icu_covid_confirmed_patients
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## icu_suspected_covid_patients icu_available_beds
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
#Cumulative total Cases
ggplot(data=CA, aes(x=date, y=totalcountconfirmed, group=1)) +
geom_line(color="gold") + ggtitle("Cumulative Total COVID-19 Confirmed Cases in CA") +
scale_x_date(date_labels = "%b") + xlab("") + ylab("Count")+ theme_fivethirtyeight()
#Cumulative total Tests
ggplot(data=CA, aes(x=date, y=testedtotal, group=1)) +
geom_line(color="green2") + ggtitle("Cumulative Total COVID-19 Tests in CA") +
scale_x_date(date_labels = "%b") + xlab("") + ylab("Count")+ theme_fivethirtyeight()
#Cumulative total Deaths
ggplot(data=CA, aes(x=date, y=totalcountdeaths, group=1)) +
geom_line(color="darkred") + ggtitle("Cumulative Total COVID-19 Related Deaths in CA") +
scale_x_date(date_labels = "%b") + xlab("") + ylab("Count")+ theme_fivethirtyeight()
As stated in the analysis of the US data, the number of hospitalized patients is a valuable metric for determining the impact of the virus over time as it represents the primary driving factor for policy making. is a The “hospitalized_covid_patients” category was not completely populated for the time frame that both confirmed and suspected hospitalizations data was available, so a new variable is created that calculates total hospitalized covid patients (confirmed + suspected).
Totalhosp <- (CA$hospitalized_covid_confirmed_patients + CA$hospitalized_suspected_covid_patients)
colors <- c("Confirmed COVID Patients" = "red", "Confirmed + Suspected COVID Patients" = "orange")
ggplot(data=CA) +
geom_line(data=CA,aes(x=date, y=hospitalized_covid_confirmed_patients, color="Confirmed COVID Patients")) +
geom_line(data=CA,aes(x=date, y=Totalhosp, color="Confirmed + Suspected COVID Patients")) +
ggtitle("Hospitalized COVID-19 Patients in CA") +
scale_x_date(date_labels = "%b") + labs(x="", y="Patients", color = "") +scale_color_manual(values = colors) +
theme_fivethirtyeight()
Because the availability of tests has increased over time, using the total hospitalized patients (confirmed + suspected) might be the best representation even though there is a possibility that some suspected cases may not actually be COVID-related.
How does Hospitalization compare to new cases? New cases might be valuable in predicting hospitalization, or the realtionship between them could be informative in terms of the impact of the virus. Based on the plot below, there is an interesting pattern of cases making its way above the hospitalized patients curve, but it is reasonable to assume that currently hospitalizations rise as the number of daily new cases rises.
colors <- c("Confirmed COVID Hospital Patients" = "red", "New positive cases" = "orange")
ggplot(data=CA) +
geom_line(data=CA,aes(x=date, y=hospitalized_covid_confirmed_patients, color="Confirmed COVID Hospital Patients")) +
geom_line(data=CA,aes(x=date, y=newcountconfirmed, color="New positive cases")) +
ggtitle("Hospitalized COVID-19 Patients vs New Cases in CA") +
scale_x_date(date_labels = "%b") + labs(x="", y="Patients", color = "") +scale_color_manual(values = colors) +
theme_fivethirtyeight()
parzen.wge(Totalhosp)
There are no non-zero peaks on the spectral density plot that are particularly revealing. We may be interested later in only modeling the most recent upward trend (late June to present), so we can also check for any underlying cycles in that portion of the series by itself.
Totalhosp2 <- Totalhosp[90:119] #This is only the hospitalized patients data from late June to present
parzen.wge(Totalhosp2)
There does not appear to be any cyclic behavior hiding in this time frame either.
acf(Totalhosp,lag.max = 50)
The ACF plot is just a reflection of what the realization showed us; slowly damping at the beginning due to the increasing trend, and then flattening out for a couple months before increasing again.
For a stationary model, we can start by trying the top 2 recommendations from the aic.wge function.
aic5.wge(Totalhosp)
## ---------WORKING... PLEASE WAIT...
##
##
## Five Smallest Values of aic
## p q aic
## 7 2 0 11.98731
## 6 1 2 11.99646
## 10 3 0 12.00066
## 8 2 1 12.00122
## 5 1 1 12.00350
#aic5.wge(Totalhosp, type = "bic") this also resulted in recommending AR2
The top two models were an AR(2) or an ARMA(1,2)
#AR(2)
Hosp_est <- est.ar.wge(Totalhosp, p=2, type = "burg")
##
## Coefficients of Original polynomial:
## 1.2538 -0.2885
##
## Factor Roots Abs Recip System Freq
## 1-0.9502B 1.0525 0.9502 0.0000
## 1-0.3037B 3.2930 0.3037 0.0000
##
##
#(ARMA)1,2
Hosp_estq <- est.arma.wge(Totalhosp, p=1,q=2) #AR component has a root very close to 1
##
## Coefficients of Original polynomial:
## 0.9779
##
## Factor Roots Abs Recip System Freq
## 1-0.9779B 1.0226 0.9779 0.0000
##
##
AIC of two stationary models above
Hosp_est$aic
## [1] 11.97533
Hosp_estq$aic
## [1] 11.99646
The AIC for the parameters of an AR(2) vs ARMA(1,2) model are nearly identical. A rolling window ASE can be used to provide another comparison of the two models.
Model 1: AR(2)
trainingSize = 30
horizon = 7
ASEHolder = numeric()
for( i in 1:(119-(trainingSize + horizon) + 1))
{
forecasts = fore.aruma.wge(Totalhosp[i:(i+(trainingSize-1))],phi = Hosp_est$phi, n.ahead = horizon)
ASE = mean((Totalhosp[(trainingSize+i):(trainingSize+ i + (horizon) - 1)] - forecasts$f)^2)
ASEHolder[i] = ASE
}
ASEHolder
hist(ASEHolder)
WindowedASE = mean(ASEHolder)
summary(ASEHolder)
WindowedASE
## [1] 228959.4
Model 2: ARMA(1,2)
trainingSize = 30
horizon = 7
ASEHolder = numeric()
for( i in 1:(119-(trainingSize + horizon) + 1))
{
forecasts = fore.aruma.wge(Totalhosp[i:(i+(trainingSize-1))],phi = Hosp_estq$phi, theta = Hosp_estq$theta, n.ahead = horizon)
ASE = mean((Totalhosp[(trainingSize+i):(trainingSize+ i + (horizon) - 1)] - forecasts$f)^2)
ASEHolder[i] = ASE
}
ASEHolder
hist(ASEHolder)
WindowedASE = mean(ASEHolder)
summary(ASEHolder)
WindowedASE
## [1] 178260.1
The ARMA(1,2) model has a lower ASE when using the windowed ASE method to compare the 2 models. The windowed ASE might be a better indicator because the behavior of the data changes over the course of the realization. This model also has some appeal to its behavior of increasing for a couple points before making its decline.
#one week forecast of ARMA(1,2) model
fq <- fore.arma.wge(Totalhosp, phi = Hosp_estq$phi, theta = Hosp_estq$theta, n.ahead = 7, lastn = FALSE, limits = FALSE)
#3 month Forecast of ARMA(1,2) model
fq <- fore.arma.wge(Totalhosp, phi = Hosp_estq$phi, theta = Hosp_estq$theta, n.ahead = 90, lastn = FALSE, limits = FALSE)
The parameters of the ARMA(2,1) model:
Hosp_estq$phi #phi coefficients
## [1] 0.9779335
Hosp_estq$theta #theta coefficients
## [1] -0.2919636 -0.1422741
Hosp_estq$avar #white noise variance estimate
## [1] 151635.6
mean(Totalhosp) #mean
## [1] 4722.504
What this means to us: Stationary models like these are useful only if it is believed that the count is going to (almost) immediately begin its decline toward the mean of the data we have (the ARMA(1,2) model has a slight increase first). Whether an imminent decline is likely or not is anyone’s guess, but based on the apparent trend in the short term, we might not consider that to be realistic for a 7 day forecast. What we do know is that the count does have to come down at some point, and for a 90 day forecast this model could be a guess that takes into account our knowledge of this assumption of eventual decrease. It is not unreasonable to assume that the number of hospitalizations will make its way down to the average number seen in the previous few months before eventually falling further later.
Using an mlp to model hospitalizations could be useful for creating a forecast that is not based on our perceptions of the data formed through the EDA. It might reveal behavioral possibilities we did not consider. The downside is that the mlp will not be “aware” of our expactations for the future, such as an inevitable decline. We can start with a univariate model with defualt mlp function hyperparameters on the whole dataset and form an ensemble of the mean of 50 repetitions.
Totalhosp_ts <- ts(Totalhosp) #create time series object
fit_mlp_uni = mlp(Totalhosp_ts, reps = 50, comb = "mean") #mlp fit
fit_mlp_uni
## MLP fit with 5 hidden nodes and 50 repetitions.
## Series modelled in differences: D1.
## Univariate lags: (1)
## Forecast combined using the mean operator.
## MSE: 155666.1615.
plot(fit_mlp_uni) #model representation
fore_mlp_uni7 <- forecast(fit_mlp_uni, h=7) #univariate 1-week mlp forecast
plot(fore_mlp_uni7)
Somewhat unsurprisingly due to the fairly constant trend with little noise at the end of the data, the mlp formed models that were essentially linear.
fore_mlp_uni90 <- forecast(fit_mlp_uni, h=90) #univariate 3 month mlp forecast
plot(fore_mlp_uni90)
The 90 day forecast shows that there is no representation of our expected decrease in hospitalizations, so this model is likely only useful for a short term forecast.
Since all of the models used to form the ensemble appeared linear, there is little reason to believe that changing the hyperparameters will have much, if any, effect. However it was attempted to see if any models could be gerenated that declined back toward the mean, but the mlp did not accomplish this.
library(tswgewrapped)
data_train.u <- data.frame(TotalHosp_wase = Totalhosp[1:99], positiveIncrease = rnorm(99, 0, .0001))
data_test.u <- data.frame(TotalHosp_wase = Totalhosp[100:119], positiveIncrease = rnorm(20, 0, .0001))
# search for best NN hyperparameters in given grid
set.seed(1234)
model.u = tswgewrapped::ModelBuildNNforCaret$new(data = data_train.u, var_interest = "TotalHosp_wase", search = 'random', tuneLength = 5, parallel = TRUE, batch_size = 50, h = 7, verbose = 1)
## Aggregating results
## Selecting tuning parameters
## Fitting reps = 16, hd = 1, allow.det.season = FALSE on full training set
## - Total Time for training: : 23.955 sec elapsed
res.u <- model.u$summarize_hyperparam_results()
res.u
## reps hd allow.det.season RMSE ASE RMSESD ASESD
## 1 10 2 TRUE 283.6218 115188.8 199.2771 159657.8
## 2 11 3 FALSE 281.9886 111393.4 190.8652 145727.3
## 3 16 1 FALSE 279.7952 108781.1 186.6877 142143.2
## 4 16 3 TRUE 282.1774 112411.7 193.5756 150470.4
## 5 22 3 FALSE 279.5324 109693.8 189.9035 144260.2
best.u <- model.u$summarize_best_hyperparams()
best.u
## reps hd allow.det.season
## 3 16 1 FALSE
The ASE of the model using these hyperparameters is shown below:
final.ase.u <- dplyr::filter(res.u, reps == best.u$reps &
hd == best.u$hd &
allow.det.season == best.u$allow.det.season)[['ASE']]
final.ase.u
## [1] 108781.1
# Final Model
caret_model.u = model.u$get_final_models(subset = 'a')
caret_model.u$finalModel
## MLP fit with 1 hidden node and 16 repetitions.
## Univariate lags: (1,2)
## Forecast combined using the median operator.
## MSE: 170416.9004.
#Plot Final Model
plot(caret_model.u$finalModel)
mlp_uni2_90 <- mlp(Totalhosp_ts, reps = 16, hd = 1, allow.det.season = FALSE)
fore_mlp_uni2_7 <- forecast(mlp_uni2_90 , h=7) #univariate 7 day mlp forecast
plot(fore_mlp_uni2_7)
After using the tool to tune the hyperparameters, the forecast produced from the model was nearly (in appearance/simplicity) the same.
Again, it is essentially impossible to say which model is “better” because our expectations for the future do not necessarily represent the past behavior of the data. Without some form of biased input, generated models are not going to take into account what we expect to happen in long term. For a short term model, the mlp had the lowest ASE so we could call that the “best” based on that metric.
The ASE was calculated for each model by comparing 7 day forecasts to the last 7 points of the data. All multivariate models used the 7 Day Average of New Positive Cases as a predictor of Hospitalized Patients. The calculated values for each model’s ASE are as follows:
Traditional methods for generating time series models for forecasting actually have very limited usefulness in this particular scenario with Hospitalized COVID-19 patients in California. Our analysis operated under the assumption that the number of cases and hospitalizations, regardless of any current trend, must eventually decline back to (nearly) zero if COVID-19 is to be considered as a novel virus. At the current time, hospitalizations are increasing in California, so non-stationary models generated based on past data continue this trend. Since we believe that a constant (or in the case of some of the models seen in this analysis, an exponentially) increasing trend is unlikely to continue for the next three months, none of these non-stationary models can be used for forecasting more than a few days. We’ve deemed this assumption to be reasonable based solely on our opinion formed from the past trends we’ve witnessed.
It is for this reason that the best model we have from this analysis for a three month forecast is the stationary ARMA(1,2) model that drives the forecast back toward the mean of the data we have.
For a one week forecast, the univariate MLP had the lowest ASE, but the multiple linear regression model using the last 7-day average of new positive cases had the lowest ASE of the multivariate models. Because of the volatility of future behaviour of this data, we don’t think that the model with lowest ASE is necessarily the best for forecasting. One advantage of the MLR model is that by using the average of the last 7 days of new cases as a predictor along with the current trend, we get a good blend of weighing future changes in hospitalizations based on its past behaviour and by the number of new cases. We learned from this model that there is a correlation between these two variables, but we could see in the plot of them that new cases went from being lower on the curve to crossing over and being higher on the curve than the number of hospitalized COVID patients. This could indicate that the number of new cases could become a poor predictor in the future, and we touched on this as part of our decision to not model new cases as the best indicator of virus severity. However for the next seven days, we would like to use the model that includes this predictor so we’ll go with the MLR model.
3 month Forecast with ARMA(1,2) model
fq <- fore.arma.wge(Totalhosp, phi = Hosp_estq$phi, theta = Hosp_estq$theta, n.ahead = 90, lastn = FALSE, limits = TRUE)
7 Day Forecast with MLR Model
plot(seq(1,108,1), Totalhosp1, type = "l",xlim = c(0,115),ylim=c(4000,9000), xlab="days", ylab = "COVID-Related Hospitalized Patients", main = "7 Day Forecast - Linear Regression with Corr Errors Model")
lines(seq(109,115,1), f_mlr2$pred, type = "l", col = "red")
lines(seq(109,115,1), (f_mlr2$pred+f_mlr2$se), type = "l", col = "red",lty = 2)
lines(seq(109,115,1), (f_mlr2$pred-f_mlr2$se), type = "l", col = "red",lty = 2)